Syntactic Re-Alignment Models for Machine Translation
نویسندگان
چکیده
We present a method for improving word alignment for statistical syntax-based machine translation that employs a syntactically informed alignment model closer to the translation model than commonly-used word alignment models. This leads to extraction of more useful linguistic patterns and improved BLEU scores on translation experiments in Chinese and Arabic. 1 Methods of statistical MT Roughly speaking, there are two paths commonly taken in statistical machine translation (Figure 1). The idealistic path uses an unsupervised learning algorithm such as EM (Demptser et al., 1977) to learn parameters for some proposed translation model from a bitext training corpus, and then directly translates using the weighted model. Some examples of the idealistic approach are the direct IBM word model (Berger et al., 1994; Germann et al., 2001), the phrase-based approach of Marcu and Wong (2002), and the syntax approaches of Wu (1996) and Yamada and Knight (2001). Idealistic approaches are conceptually simple and thus easy to relate to observed phenomena. However, as more parameters are added to the model the idealistic approach has not scaled well, for it is increasingly difficult to incorporate large amounts of training data efficiently over an increasingly large search space. Additionally, the EM procedure has a tendency to overfit its training data when the input units have varying explanatory powers, such as variable-size phrases or variable-height trees. The realistic path also learns a model of translation, but uses that model only to obtain Viterbi wordfor-word alignments for the training corpus. The bitext and corresponding alignments are then used as input to a pattern extraction algorithm, which yields a set of patterns or rules for a second translation model (which often has a wider parameter space than that used to obtain the word-for-word alignments). Weights for the second model are then set, typically by counting and smoothing, and this weighted model is used for translation. Realistic approaches scale to large data sets and have yielded better BLEU performance than their idealistic counterparts, but there is a disconnect between the first model (hereafter, the alignment model) and the second (the translation model). Examples of realistic systems are the phrase-based ATS system of Och and Ney (2004), the phrasal-syntax hybrid system Hiero (Chiang, 2005), and the GHKM syntax system (Galley et al., 2004; Galley et al., 2006). For an alignment model, most of these use the Aachen HMM approach (Vogel et al., 1996), the implementation of IBM Model 4 in GIZA++ (Och and Ney, 2000) or, more recently, the semi-supervised EMD algorithm (Fraser and Marcu, 2006). The two-model approach of the realistic path has undeniable empirical advantages and scales to large data sets, but new research tends to focus on development of higher order translation models that are informed only by low-order alignments. We would like to add the analytic power gained from modern translation models to the underlying alignment model without sacrificing the efficiency and empirical gains of the two-model approach. By adding the 360 u n s u p e r v i s e d
منابع مشابه
Combining Models for the Alignment of Parallel Syntactic Trees
The alignment of syntactic trees is the task of aligning the internal and leaf nodes of two sentences in different languages structured as trees. The output of the alignment can be used, for instance, as knowledge resource for learning translation rules (for rule-based machine translation systems) or models (for statistical machine translation systems). This paper presents some experiments carr...
متن کاملRe-structuring, Re-labeling, and Re-aligning for Syntax-Based Machine Translation
This article shows that the structure of bilingual material from standard parsing and alignment tools is not optimal for training syntax-based statistical machine translation (SMT) systems. We present three modifications to the MT training data to improve the accuracy of a state-of-theart syntax MT system: re-structuring changes the syntactic structure of training parse trees to enable reuse of...
متن کاملStatistical Machine Translation of Parliamentary Proceedings Using Morpho-Syntactic Knowledge
This paper presents an overview of the University of Washington statistical machine translation system developed for the 2006 TCSTAR evaluation campaign. We use a statistical phrase-based system with multiple decoding passes and a log-linear probability model. Our main focus was on exploring the possibility of using morpho-syntactic knowledge (lemmas and part-of-speech tags) for word alignment,...
متن کاملHybrid System Combination for Machine Translation: An Integration of Phrase-level and Sentence-level Combination Approaches
Hybrid System Combination for Machine Translation: An Integration of Phrase-level and Sentence-level Combination Approaches Wei-Yun Ma Given the wide range of successful statistical MT approaches that have emerged recently, it would be beneficial to take advantage of their individual strengths and avoid their individual weaknesses. Multi-Engine Machine Translation (MEMT) attempts to do so by ei...
متن کاملLoosely Tree-Based Alignment for Machine Translation
We augment a model of translation based on re-ordering nodes in syntactic trees in order to allow alignments not conforming to the original tree structure, while keeping computational complexity polynomial in the sentence length. This is done by adding a new subtree cloning operation to either tree-to-string or tree-to-tree alignment algorithms.
متن کاملTwo Memory-Based Methods for Phrase Alignment
This document presents two bilingual phrase-based alignment methods handling syntactic constituents (sub-sentential components) of parallel sentences. The methods rely on an asymmetrical parsing of both languages: Light part-of-speech tagging for the target language, syntactic tree building for the ’source’ language and the complexity of each is studied. One of their benefits is that they do no...
متن کامل